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Abstract 

This paper investigates the two-step estimation of a high dimensional additive 
regression model, in which the number of nonparametric additive components 
is potentially larger than the sample size but the number of significant additive 
components is sufficiently small. The approach investigated consists of two steps. 
The first step implements the variable selection, typically by the group Lasso, 
and the second step applies the penalized least squares estimation with Sobolev 
penalties to the selected additive components. Such a procedure is computation- 
ally simple to implement and, in our numerical experiments, works reasonably 
well. Despite its intuitive nature, the theoretical properties of this two-step pro- 
cedure have to be carefully analyzed, since the effect of the first step variable 
selection is random, and generally it may contain redundant additive components 
and at the same time miss significant additive components. This paper derives 
a generic performance bound on the two-step estimation procedure allowing for 
these situations, and studies in detail the overall performance when the first step 
variable selection is implemented by the group Lasso. 
AMS2010 subject classifications: 62G05, 62J99 
Key words: additive model, group Lasso, penalized least squares. 

1 Introduction 

In this paper, we are interested in estimating the nonparametric additive regression 
model 

y^ = c* + g*{zi) + Ui, g*iz) = gl{zi) + ■■■ + g*{zd), E[m,|z,] = 0, (1.1) 
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where is a dependent variable and Zi — {zi\^ . . . , Zi^)' is a vector of d explanatory 
variables. Throughout the paper, we assume that the observations are independent 
and identically distributed (i.i.d.). We presume the situation in which d is larger than 
the sample size n (in fact wc allow for that c? is a non-polynomial order in n), but most 
of gl, . . . arc zero functions. So the model of interest is a high dimensional sparse 
additive model. Let Z denote the support of Zi. Without loosing much generality, wc 
assume that Z = [0, 1]'^. The unknown additive components gl, . . . are known to 
belong to a given class Q of functions on [0, 1]. Throughout the paper, we consider the 
case in which ^ is a Sobolev class: Q — Vl^2^([0, 1]), where u is a, positive integer and 

W^iiO, 1]) := jfif : [0, 1] M : g^"'^^ is absolutely continuous such that 

J\^''\z)^dz < 

For identification of g^, . . . ,g^, we assume that 

Eb;(^ii)] = o, i<yj<d. 

Let T* :— {j e {1, . . . , d} : E[gj{zij)'^] ^ 0}, the index set of nonzero components, and 
s* := |T*|, the number of nonzero components. It is assumed that s* is smaller than 
n. 

There has been a growing interest in estimation of high dimensional sparse additive 
models (Lin and Zhang, 2006; Ravikumar et al., 2009; Meier et al., 2009; Huang et 
al., 2010; Koltchinskii and Yuan, 2010; Raskutti et al., 2010; Suzuki et al., 2011; Fan 
et al., 2011; Buhlmann and van de Geer, 2011). Parallel to parametric regression 
models, sparsity of the underlying structure makes it possible to estimate consistently 
the parameter of interest (in this case, the conditional mean function) even when d is 
larger than n. Estimation accuracy is not a sole goal. In fact, it may happen that, 
despite the underlying sparsity structure, an estimator containing many redundant 
components has a good estimation accuracy. However, to make a better interpretation, 
one wishes to have a concise model. Therefore, the goal is to obtain an estimator that is 
(i) appropriately sparse, in the sense that it does not contain many redundant additive 
components, and at the same time (ii) possesses a good estimation accuracy. 

A distinctive feature of the present nonpar ametric estimation, when compared with 
the parametric case, is that the function class Q is much more complex, which brings a 
new challenge. To address this problem, in a fundamental paper, Meier et al. (2009), 
they proposed a penalized least squares estimation method with the novel penalty 
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where 



term: 

Ai E ^M\ln+~^2l{9,? + A3 E ^(^^O'' (1-2) 

^ i=l Jo 

The term ||g'j||2,n penahzes the event that gj enters the model, thereby to enforce 
sparsity of the resulting estimator; the term I{gj) penalizes roughness of gj and controls 
the complexity of the class Q, thereby to avoid an overfitting and guarantee a good 
estimation accuracy of the resulting estimator. See Koltchinskii and Yuan (2010); 
Raskutti et al. (2010); Suzuki et al. (2011) for a further progress. An important 
theoretical fact is that, as Suzuki et al. (2011) showed^, under suitable regularity 
conditions such as the uniform boundedness of the error term, the Meier et al. (2009) 
estimator achieves the minimax rate of convergence (in the L2-risk) s*S'^ where 



d := max < n 'V | 



See Raskutti et al. (2010) and Suzuki et al. (2011) for minimax rates in our problem. 
Thus, from a theoretical point of view, their estimator has a good convergence property. 

However, we would like to point out that the double penalization strategy that 
Meier et al. (2009) used may practically lead to a loss of accuracy in estimation/variable 
selection. In practice, the sparsity penalty brings a shrinkage bias to the selected 
additive components, so the resulting estimator may have a worse performance than an 
oracle estimator, which is an "estimator" constructed as if T* were known, even when 
the correct model selection is achieved. Furthermore, choosing the tuning parameters in 
such a way that the estimation accuracy is optimized would result in including too many 
redundant variables. The problem of shrinkage bias caused by sparsity penalties has 
been recognized in the parametric regression case. In the linear regression case, Belloni 
and Chernozhukov (2011b) considered the two-step estimator of the coefficient vector, 
which corresponds to the least squares estimator applied to the variables selected by 
the Lasso (Tibshirani, 1996), and observed that in their simulation experiments the 
two-step estimator significantly outperforms the Lasso estimator because the former 
can remove a shrinkage bias. Motivated by these observations, we consider the two- 
step estimation of high dimensional additive models in which the first step implements 



^Suzuki ct al. (2011) adopted a slightly different formulation than Meier et al. (2009), i.e., in 
Suzuki et al. (2011), the Sobolev penalty I{gj) is replaced by the reproducing kernel Hilbert space 
norm. However, essentially the same proof applies to the original Meier et al. (2009) estimator. 
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the variable selection, typically by the group Lasso (Yuan and Lin , 2006), and the 
second step applies the penalized least squares estimation with Sobolev penalties to 
the selected additive components. The paper is devoted to a careful study of the 
theoretical and numerical properties of this two-step estimator. 

The main theoretical finding of the paper is to derive a generic bound on the L2-risk 
of the second step estimator. In a typical situation, the bound reduces to 

max [s*n-'-'/('-'^'\ \f\T*\S', \\E,eT^\f9nl} ' (1-3) 

where T C {!,..., d} is the index set selected by the first step variable selection. 
Importantly, this bound applies to any variable selection method such that, roughly 
speaking, the size IT] is stochastically not overly large compared with s*, and holds 
in both the situations in which (i) T may have redundant variables (i.e., T\T* ^ 0), 
and (n) T may miss significant variables (i.e., T*\T ^ 0). This bound has a natural 
interpretation. The first term s*ri-2i//(2;/+i) corresponds to the oracle rate, the rate 
that could be achieved when T* were known; the second term \T\T*\5'^ corresponds to 
the effect of selecting redundant variables; the third term || '^j(zT*\f 9*j\\2 corresponds 
to the effect of missing significant variables. 

One may wonder that it is plausible to presume the perfect model selection (i.e., 
T = T* with probability approaching one) , in which case the analysis becomes trivial, 
since one may guarantee the perfect model selection by applying a hard thresholding 
method to the first step group Lasso or using the adaptive group Lasso. However, what 
we need is a bound that applies to a general situation in which T may fail to recover 
T* . In view of the literature, to guarantee the perfect model selection requires a side 
condition that the non-zero additive components are well separated from zero (in the 
L2-sense), which is considerably restrictive from a theoretical point of view. In fact, 
under the side condition, the exact oracle rate s*n~^'^/*^^'^+^'' will be achievable, and in 
view of the minimax rate, this means that the complexity of the problem is significantly 
reduced.^ Therefore, to make a meaningful comparison with existing estimators such 
as the Meier et al. (2009) estimator, one has to establish a performance bound without 
presuming the perfect model selection. An important aspect of the bound (1.3) is 
that it characterizes the effect of the first step variable selection in an explicit manner, 
which makes the analysis non-trivial. Another interesting finding is that, despite the 
random fiuctuation of T", the smoothing penalty level in the second step can be taken 
independent of d. 

In this paper, we primarily focus on to use the group Lasso as a first step variable 
^i/log d/n can be dominant in 6 as long as log d/n^/^^''"'"^^ — )• oo. Our analysis allows for this case. 
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selection method. The side (and hence not main) contribution of the paper is to 
estabhsh (some) refined asymptotic results on the statistical properties of the group 
Lasso estimator for high dimensional additive models, which complements the recent 
literature on the theoretical study of the group Lasso (Nardi and Rinardo, 2008; Bach, 
2008; Wei and Huang, 2010; Huang and Zhang, 2010; Huang ct al., 2010; Louinici et al., 
2011; Nagahban et al., 2010). The group Lasso, when applied to estimation of additive 
models, is based on a diff'erent idea of dealing with the complexity of function classes, 
i.e., approximating each function class by a finite dimensional class of functions and 
controlling the complexity by its dimension.^ Expanding each additive component by 
a linear combination of given basis functions, selection of additive components reduces 
to selection of groups of the coefficient vector to the basis expansion, so that the group 
Lasso turns out to be an effective way of selecting additive components. Combined 
with the bound (1.3), when the group Lasso is used as a first step procedure, it will be 
seen that (under suitable regularity conditions, of course) (i) the two-step estimator is 
at least as good as the Meier et al. (2009) estimator, meaning that it achieves the rate 
s*5^ in general cases in which T may fail to recover T* (so T\T* 7^ or T*\T ^ 0, 
or both); (ii) if it happens that the perfect model selection holds, then the two-step 
estimator enjoys the exact oracle rate s*j7,-2i'/(2i'+i). ^jjj-^ second step estimation can 
automatically adapt to both situations (i) and (ii), i.e., adapt to the model selection 
ability of T. We believe that these theoretical results in the context of estimation of 
high dimensional additive models are useful. 

We also carry out simulation experiments to investigate the finite sample property 
of the two-step estimator. The simulation results suggest that the proposed two-step 
estimator is a good alternative in estimating high dimensional additive models. 

There are a large number of works on the theoretical analysis of penalized estima- 
tion methods for high dimensional sparse models, especially on the Lasso for linear 
regression models (Bunea et al., 2007a,b; Zhao and Yu, 2007; Zhang and Huang, 2008; 
Meinshausen and Yu, 2009; Wainwright, 2009; Candcs and Plan, 2009; Bickel et al., 
2009; Zhang, 2009), generalized linear models (van dc Gccr, 2008; Nagahban et al., 
2010) and quantile regression models (Belloni and Chernozhukov, 2011a). See also 
Buhlmann and van dc Gccr (2011) for a recent review. In the quantile regression con- 
text, Belloni and Chernozhukov (2011a) formally established the theoretical properties 
of the post-penalized estimator that corresponds to the unpenalized quantile regres- 
sion estimator applied to the variables selected by the £i-penalized estimator. Their 

^See Chapter 10 of van de Geer (2000) for two different ideas, namely, penalty and sieve approaches, 
to deal with the complexity of function classes in nonparametric regression. 
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analysis is extended to the linear regression case in Belloni and Chernozhukov (2011b). 
As noted before, the present paper builds on these fundamental papers, but has two 
important theoretical departures from the previous analysis: (i) the model of interest 
is a nonparametric additive model, and (ii) the second step estimation has smoothness 
penalty terms. 

The remainder of the paper is organized as follows. Section 2 describes the two- 
step estimation method. Section 3 presents some simulation experiments. Section 4 is 
devoted to the theoretical study. Section 5 concludes. Section 6 provides a proof of 
Theorem 4.1. Some other technical proofs are gathered in Appendices. 

Notation: In the theoretical study, we rely on the asymptotic scheme in which d 
and s* may diverge as the sample size n. Hence we agree that all parameters values 
(such as d,s* . . .) are index by n and the limit is always taken as n — >■ oo, but we omit 
the the index n in most cases. For two sequences a — a{n) and b — b{n), we use the 
notation a < 6 if there exists a positive constant C independent of n such that a < Cb, 
a X 6 if a < 6 and b < a, and a <p b ii a — Op{b). Let denote the unit sphere on 
R' for a positive integer I. Let Ii denote the I x I identity matrix. We use || • \\e to 
indicate the Euclidean norm, and let || • ||oo denote the supremum norm. For a matrix 
A, let II A II denote the operator norm of A. For a symmetric positive semidefinite 
matrix A, let A^^^ denote the symmetric square root matrix of A. Let || • ||2,n and 
II • II 2 denote the empirical and population L2 norms with respect to Zj's respectively, 
i.e., ioT g: Z ^R, 

To make the notation simpler, if we write the index j in gj : [0, 1] — > M, we agree that 

1=1 

2 Two-step estimation 

This section describes the proposed estimation method. 

First step: Use an appropriate variable selection method to determine a subset T 

of {i,...,4. 
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Second step: Apply a penalized least squares method with roughness penalties 
to the selected additive components: 



(c,5'i, j e T) := arg min 



1 " 

»=i jef jef 



(2.1) 

subject to the restrictions Yli=i9j{^ij) — 0; ^ ^) where A2,j > are smoothing 
parameter and the term /(•) is the Sobolev penalty: 



Jo 



The resulting estimator of g* is given by g[z) :— ^j^f 9j{zj)- In the theoretical study, 
to make the argument simple, we let 

It will be shown that A2 x n''^^^'^'^'^^^ gives a correct choice. 

There are several possible choices for the first step variable selection method. We 
primarily focus on to use the group Lasso. 

Group Lasso: Suppose that we have a set of basis functions {■^i, . . . , ■0m} on [0, 1] 
(except for the constant function). The number m — rUn should be taken such that 
m — >■ 00 as n — >■ 00 but m — o{n). It will be shown that m x gives an 

optimal choice. We estimate each additive component by a linear combination of basis 
functions. Let := {9 : [0, 1] ^ M : g{-) = c + Y.k=i Mk{-), c e M, G M, 1 < A; < 
m}. We consider the estimator: 

^ Y^iVi - c - Ejti5i(%)}' + ^^^^ \\9j\\2,n 

(2.2) 

subject to the restrictions X^ILi 93(^^3) ~ 0, 1 < Vj < d, where Ai is a nonnegative tun- 
ing parameter that controls sparsity of the resulting estimator. The resulting estimator 
of g* is given by g{z) :— Yl'j=i 9j{^j)- It will be shown that 



(c,^i,...,^d) := arg niin 

ceM.,gjegm,l<J<d 



I f- nhgd 
Ai X max < ^/n, 



m 



gives a correct choice. Let :— {j e {1, . . . , d} : ||^j||2,n > 0}. 

It is more convenient to concentrate out the constant term when analyzing the 
estimator g. Define XiOj := (^1(2:^^), . . . , V'm(%))', *iG,- — XiG^ " (*g, — 

EILi XiGj) for 1 < J < and Xi := {x'^g^, x'^q^, • • • , x'^aj- Let ±j := n"^ ^"^^ Xic^x 



(3 :— arg min 



(2.3) 



for 1 < i < d and ± := n'^ ELi ^i^'i- (3 = (Ai, . . . , A^, /^ai, • • • , /3dm)' e M'^-, 
we use the notation /Sg^ = {f3ji, . . . , Pjm)' for 1 < J < c?. Working with this no- 
tation, it is seen that c = Y^'.^^{yi - Yfj=i9j{zij)) = y ■= n'^YJi=iyi and 

^i(^i) = EfcLi4fc(^fe(^i) -'^jfc) (^ife ^"^Er=i^fc(%);l < j < d,l < k < m), 
where 

j=i i=i 
Therefore, the estimator g is computed by solving the group Lasso problem (2.3), 
and we call g the group Lasso estimator. The group Lasso estimator is known to be 
groupwise sparse. In the present context, this means that some of additive components 
are estimated as zero functions. 
Some comments are in order. 

Remark 2.1. In the group Lasso, there is no need to use common basis functions for 
all j; i.e., we may use different basis functions for different j. To make the notation 
simpler (e.g. to avoid the extra index j to V'l, • • • and Qm etc.), we write the 

group Lasso procedure as it is. 

Remcirk 2.2 (Computation). Because the proposed method is a combination of two 
commonly used methods, it can be implemented by using standard statistical software 
packages. In this sense, implementation of the proposed method is simple. 

Remcirk 2.3 (Other options for the first step procedure). Although we primarily 
focus on the group Lasso for the first step variable selection, it is possible to use other 
variable selection methods available in the literature. For instance, the nonparametric 
independence screening (NIS) method proposed in Fan et al. (2011) is known to be 
a computationally effective way of screening variables. However, in view of (1.3), to 
obtain a performance bound on the second step estimator, a suitable bound on the 
magnitude of missed components || 'Ylij^T*\f 9*j\\2 required, and at the moment it is 
not known whether NIS ensures a reasonable bound on it. A preferable feature of the 
group Lasso is that under certain regularity conditions it gives reasonable bounds on 
both |f°\r*| and || EjeT*\ro II2 (see Section 4). 

Remark 2.4 (Other options for the second step procedure). There is an alternative 
second step estimator of g*, namely, the sieve least squares estimator applied to the 
selected additive components: 



(c,^j,J e T) := arg min 



1 " 



2n 
1=1 



(2.4) 
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subject to the restrictions S"=ifl'i(%) = 0; ^ T (note: Remark 2.1 applies to this 
case). Let g := J2jef9j- expected that a similar conclusion (to g) holds for this 
estimator. In terms of estimation accuracy, it is difficult to judge which is theoretically 
better. To make the paper focused, we restrict our attention to g and not make a 
formal study of g, but compare their finite sample performance by simulations. In 
our limited simulation experiments, g outperforms g (see Table 1 ahead), which is a 
(partial) motivation of studying g. 

3 Simulation experiments 

This section reports simulation experiments that evaluate the finite sample perfor- 
mance of the estimators. The estimators under consideration are the group Lasso (GL) 
estimator defined by (2.2), the sieve least square estimator apphed to the variables se- 
lected by the group Lasso (called GL-SL estimator) defined by (2.4) with T — T^, 
the penalized least squares estimator applied to the variables selected by the group 
Lasso (called GL-PL estimator) defined by (2.1) with T — T^, the penahzed least 
squares estimator with known true support (called ORACLE estimator) defined by 
(2.1) with f = T*, the Meier et al. (2009) estimator (called MGB estimator). The 
MGB estimator is defined by a minimizer to the least square criterion function subject 
to the penalty (1.2) with A3 = 0. The choice A3 = is not theoretically optimal, but 
what Meier et al. (2009) actually proposed in practice is this estimator, so in these 
experiments we take A3 = 0. 

To implement the group Lasso, we have to determine basis functions and the penalty 
level Al. We use cubic B-splines with four evenly distributed internal knots (so m = 
7). To choose the penalty level, we use an AIC type criterion. Let g\^ denote the 
GL estimator of g* with penalty level Ai. We choose the optimal penalty level that 
minimizes the criterion 

AIC(Ai) = n\og{Ztliy^ -y- 9x.)Vn) + 2m|f °|, 

where := {j e {l,...,d} : ||^j,Ai||2,n 7^ 0}. Certainly there are other options to 
choose the penalty level Ai, such as the cross validation. Here we use the AIC because 
of its intuitive nature and since it is simple to implement. To compute GL estimates, we 
use the package grplasso in R. To compute GL-PL estimates and ORACLE estimates, 
we use the package mgcv in R in which the smoothing parameters are automatically 
optimized according to GCV (by default). See Wood (2006). Comparison with the 
MGB estimator is not a standard task since its performance depends on the multiple 
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tuning parameters. To guarantee a fair comparison, according to a preliminary sim- 
ulation work, we prepared a set of candidates values for (Ai,A2) and evaluated the 
performance of the MGB estimator for each (Ai, A2). The set of candidate values is 
given by 

{(Ai, A2) : Ai = Ai X Aina^/n, Ai e {0.12, 0.08, 0.04, 0.02}, A2 e {0.05, 0.02, 0.01, 0.005}}, 

where X^ax is computed by the lambdamax option in the grplasso package when the 
minimization problem is transformed to the group Lasso problem. 

Each estimator is evaluated by the empirical mean square error (EMSE). Let := 
c* + g*{zi) and for a generic estimator {c,g) of {c*,g*), let jli :— c + g{zi). Then, the 
EMSE is defined as 

n 

EMSE := E[n-^^(/i, -//,)']. 

i=l 

For GL and MGB estimators, we compute the average numbers of numbers of variables 
selected (NV), false positives (FP) and false negatives (FN). 

The number of Monte Carlo repetitions is 500. We consider the case where n — 
400 and d — 1,000. The explanatory variables Zj = {zn, . . . , Zio)' are generated as: 
Zij — {wij + tui) /{I -\- t) for J = 1, ■ ■ ■ , where Wj, wn, . . . , Wid are i.i.d. uniform 
random variables on [0,1]. The parameter t controls correlation between variables, 
i.e., a larger t imphes a larger correlation. Three cases t — 0.5 or 1 are considered. 
In what follows, let gi{z) — z,g2{z) — {2z — iy,gs{z) — sin(27rz)/(2 — sin(27rz)) and 
g^iz) = 0.1sin(27rz) + 0.2cos(27r^) + 0.3sin2(27r^) + 0Acos^{2ttz) + 0.5 sin^(27rz). We 
consider two models. 

Model 1 5gi(zii) + Sg2{zi2) + 4.g3{zi3) + 6g4(zi4) + VlJ^Si, q ~ A^(0, 1). 

Model 2 Vi = 3.5gi{za) +2.1g2{zi2) + 2.8g3{zi3) + 4.2g^{zi^) +3.5g^{zi5) +2.1g2{zie) + 
2.8g3{zi7) + i.2g^{zis) + VUiei, q - iV(0, 1). 

The coefficients in model 2 are adjusted in such a way that the variance of the con- 
ditional mean of i/i given Zj is roughly the same as in model 1. These designs are 
essentially adapted from Meier et al. (2009). 

The simulation results are given in Tables 1 and 2. Table 2 shows the performance 
of the MGB estimator with the tuning parameters chosen in such a way that the 
EMSE is minimized, and hence the EMSEs in Table 2 should be understood as the 
ideal EMSEs of the MGB estimator. Overall the MGB estimator, with the tuning 
parameters chosen in such a way that the EMSE is minimized, includes too many 
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redundant variables. This feature is consistent with the simulation study in Fan et al. 
(2011). 

In model 1, in which the number of nonzero additive components is small (s* = 4) 
and each nonzero additive component has a relatively large signal, the variable selection 
by the group Lasso works well, and hence the GL-PL estimator performs strictly better 
than the ideal MGB estimator in the EMSE in all cases. 

In model 2, in which the number of nonzero additive components is large (s* = 8) 
and each nonzero additive component has a relatively small signal (compared with 
model 1), the performance of the GL-PL deteriorates, especially when t = 1. When 
t = 1, that is, the correlation among Zi is high, it is difficult to detect the nonzero 
additive components correctly, and the group Lasso on average does not work very well 
and the EMSE of the GL-PL estimator is worse than the MGB estimator. However this 
better performance of the MGB estimator is at the cost of selecting many redundant 
additive components: on average it includes 75 redundant additive components. It 
turns out that the performance of the MGB estimator is sensitive to the value of Ai 
and not to A2. Table 3 shows the performance of the MGB estimator in model 2 
with t — 1 and A2 = 0.05, and with different values of Ai (the best EMSE among all 
candidate (Ai, A2) in model 2 with i = 1 is achieved at Ai = 0.02 x Xmax/n, A2 = 0.05, 
which is the reason why we focus on the A2 = 0.05 case). Increasing Ai = 0.02 to 
Al = 0.04 makes the number of false positives small, on average from 75 to 9, but 
makes the EMSE worse, from 0.528 to 0.921. Taking this into account, we may see 
that the GL-PL works reasonable well. 

4 Theoretical study 
4.1 Basic conditions 

In this section, we introduce basic conditions commonly used in the analysis of the 
first and second step estimators. 

(CI) (Restriction on the data generating process) {(y^, z-)' : i — 1,2,...} are i.i.d. 
where the pair (yi, z[y satisfies the model (1.1). 

(C2) (Restriction on the (conditional) distribution of ui) The distribution of ui is 
such that either: 

(a) the support of ui is bounded, or; 
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Table 1: Simulation results 
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"GL" refers to the group Lasso estimator, "GL-SL" to the group Lasso + sieve 
least squares estimator, "GL-PL" to the group Lasso + penalized least squares 
estimator, "ORACLE" to the penalized least squares estimator with known true 
support, "NV" to the number of selected variables, "FP" to the false positive, 
"FN" to the false negative, and "EMSE" refers to the empirical mean square 
error. Standard deviations are given in parentheses. 



12 



Table 2: Simulation results (continued) 









MGB 




Case 




NV 


FP 


FN 


EMSE 


Model 1 {t 


= 0) 


17.86 


13.86 


0.00 


0.357 






(6.08) 


(6.08) 


(0.00) 


(0.068) 


Model 1 (t = 


= 0.5) 


13.01 


9.01 


0.00 


0.361 






(4.85) 


(4.85) 


(0.00) 


(0.069) 


Model 1 (t 


= 1) 

/ 


70.19 

(9.90) 


66.19 

(9.90) 


0.00 

(0.00) 


0.349 

(0.059) 


Model 2 {t 


= 0) 


62.53 


54.53 


0.00 


0.553 






(11.77) 


(11.77) 


(0.00) 


(0.080) 


Model 2 = 


= 0.5) 


44.33 


36.33 


0.00 


0.532 






(11.11) 


(11.11) 


(0.06) 


(0.081) 


Model 2 (t 


= 1) 


83.00 


75.01 


0.01 


0.528 






(10.14) 


(10.14) 


(0.08) 


(0.070) 



"MGB" refers to the (ideal) Meier et al. (2009) es- 
timator, "NV" to the number of selected variables, 
"FP" to the false positive, "FN" to the false negative, 
and "EMSE" refers to the empirical mean square er- 
ror. Standard deviations are given in parentheses. 



Table 3: Simulation results for the MGB estimator in model 2 with 
t — 1 and A2 = 0.05, and different values of Ai 





Al = 0.12 


Al = 0.08 


Al = 0.04 


Al = 0.02 


NV 


5.14 (0.99) 


7.82 (1.94) 


16.77 (4.51) 


83.00 (10.14) 


FP 


0.27 (0.55) 


1.41 (1.59) 


8.89 (4.48) 


75.01 (10.14) 


FN 


3.14 (0.78) 


1.60 (1.00) 


0.12 (0.35) 


0.01 (0.08) 


EMSE 


2.167 (0.146) 


1.624 (0.145) 


0.921 (0.125) 


0.528 (0.070) 



"NV" refers to the number of selected variables, "FP" to the false 
positive, "FN" to the false negative, and "EMSE" refers to the em- 
pirical mean square error. Standard deviations are given in paren- 
theses. 
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(b) ui\zi ~ N{0,au{zi)'^) and au{zi) < au almost surely for some constant cr„ 
independent of n. 

(C3) (Restrictions on the distribution of Zi) 

(i) The support of Zi is [0, l]''. 

(ii) Let Qj denote the density of Zij for each 1 < j < d. Then, qj is bounded 
away from zero on [0, 1] uniformly over 1 < j < d, i.e., there exists a positive 
constant Cq such that Cq < qj on [0, 1] for all 1 < j < d. 

(C4) (Restriction on smoothness of the additive components) g* & Q for all j & T*, 
where Q = 1^2'' ([0; M) some positive integer u. 

(C5) (Prehminary restrictions on d and s*) d>n, logd/n^"/^'^'"^^^ and 1 < s* < 
n. 

Condition (CI) is a standard assumption. Condition (C2) needs an explanation. 
It turns out that the key property to our rate results in Theorems 4.1 and 4.2 (and 
indeed to those in Koltchinskii and Yuan (2010), Raskutti et al. (2010) and Suzuki et 
al. (2011) as well) is the normal concentration property (around its mean and given 
Zi, . . . , Zn) of a random variable of the form sup^^T- Y17=i "^i^i where T is a bounded and 
countable subset of (T typically depends on Zi, . . . , z„). In fact, condition (C2) is 
a primitive sufficient condition that ensures this normal concentration property. See 
Appendix C for more discussion on this condition. Condition (C3) is standard in 
the series estimation hterature (see e.g. Newey, 1997). Condition (C4) restricts the 
smoothness property of each additive component g*. We exclude the case that u is 
fractional. Condition (C5) is a preliminary restriction on the growth rate of d. To 
make the technical argument simpler, we here assume that d> n. Because our primal 
concern is on the "d ^ n" case, this restriction does not bind. The second part of 
condition (C5) is to restrict d not to grow too fast. The last part of condition (C5) is 
a natural restriction on s*. 

4.2 A generic bound on the second step estimator 

In this section, we present a generic bound on the second step estimator. Although 
we primarily focus on to use the group Lasso as a first step procedure, the result of 
this section holds for any variable selection method satisfying the high level condition 
stated below. 
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We first prepare some notation. Let Qj := {gj G G : E[gj{zij)] = 0}. For a subset 
T G {1,. . .,d}, define 

a{T) := inf J a > : E bjWl < c^\\Y.9j\\l "^dj e (j G T) 

The quantity a{T)^^ is an analogue of sparse minimum eigenvalues to the infinite 
dictionary case. It is clear that when -Zy, j G T are independent, a{T) = 1, so Q;(T) 
measures the dependence among variables Zij,j G T (recall that each function in Qj 
is centered such that E[5fj(2;ij)] = 0). Such a quantity appears in other papers on 
estimation of high dimensional additive models (Koltchinskii and Yuan, 2010; Suzuki 
et al., 2011). Observe that a{T) > 1 for any non-empty T C {1, . . . , d}. 
We introduce a high level condition on T. Put 



(C6) (Restriction on the set f ) 71^2(2^+1) 5q;(T* U f)\T* U f] = 0^(1). 

Note that under condition (C5), 71^2(2^+1) (J = maxjn-^^'^-i^/^CZiv+i)^ ^ylogd/n^^''/(^^''+^^ 
0. Condition (C6) requires that a{T* U T) and \T* U T\ are not too large. In the 
canonical case in which a{T* U T) <p 1 and \T\ <p s*, condition (C6) is satisfied if 
s* = o[min{n(^''-^)/^(^''+^), ^712^/(2^+1)/ log d}]. ghall comment that, even when T* 
were known, a condition analogous to (C6) is needed to obtain a reasonable bound 
on the estimator, so we beheve that, as long as \T\ is stochastically not overly large 
compared with s*, condition (C6) is a reasonable restriction. It will be shown that, 
when the group Lasso is used as a first step procedure, <p s*. 

We are now in position to state the main theorem of this paper. 

Theorem 4.1. Assume conditions (C1)-(C6). Take A2 such that X2 > ■A.2,u,v'n~^^^'^^^^\ 
where A2,u,v is some positive constant depending only on the distribution of ui and the 
smoothness index v. Then, we have 



iGT 



max \a{T* U f)\T* n f\n-^''/^'^''+^\a{T* U f)\f\T*\5'^, n-^'''^'^''+^'^\\g*\\l 



jeT*r\T jeT*\T 
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In particular, in the canonical case in which (i) a{T* U T) <p 1; (ii) \\g*\\'i < s* ; (Hi) 
E,er* I{9*jf < s\ for A2 > ^s.^.^n-^^/^^^+i), we have 

\\~9 - 9% + A^I^^fe)' <P max{s*A^, \f\T*\5^ \\j:jeT*\f9-\\l} ■ (^-l) 
jet 

Remcirk 4.1. In principle, it is possible to state the theorem in a non- asymptotic 
manner; however, to make the exposition clear, we state the theorem as it is. 

Interestingly, A2 can be taken independent of d despite the random fluctuation of 
T. This is in contrast to the fact that, e.g. in Koltchinskii and Yuan (2010); Raskutti 
et al. (2010), penalty levels (on smoothness) should scale as \ogd as d ^ 00. 

This theorem characterizes the effect of the first step variable selection in an ex- 
plicit manner: in (4.1), (i) the first term s*A| refiects the oracle rate; (n) the second 
term \T\T*\5'^ refiects the cost of selecting redundant components; (in) the third term 
lEjeT'Xffl'llli refiects the magnitude of missed components. We will investigate the 
behaviors of these terms when the group Lasso is used as a first step procedure. 

4.3 Properties of the group Lasso 

In this section, we collect the statistical properties (namely the convergence rate and 
the model selection property) of the group Lasso estimator g used as a first step 
estimator. Although such properties have been well studied in the literature especially 
for the parametric regression case (Nardi and Rinardo, 2008; Bach, 2008; Ravikumar et 
al., 2009; Huang and Zhang, 2010; Wei and Huang, 2010; Huang et al., 2010; Louinici 
et al., 2011; Nagahban et al., 2010), we could not find results that we exactly need 
in the very present setting, in particular an explicit scaling condition on the triple 
(d, s*,m) that guarantees the statistical properties. For the sake of completeness, we 
state here these properties. Their proofs are found in Appendix. 
We begin with introducing restrictions on basis functions. 

(C7) (Restrictions on basis functions used in the first step estimation) 

(a) sup^g[o,i] ll(V'i(^),---,'0m(^))'IU = 0(mV2). 

(b) E[xiG,x[(.^] = for aU I < j < d. 

(c) infgegT* \\g*-g\\l < s*m-^', where := {g : Z ^ W : g{z) = Y^j^T* ^j(^j) = 
(zi,...,^d)'), gj&Qmij^T*)}- 

We refer to Newey (1997) for some basic materials on series estimation. Condition 
(C7)-(a) is satisfied for sphnes and Fourier bases. Condition (C7)-(b) is a normalization 
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condition, and does not lose any generality as long as we are concerned with the 
analysis of the statistical properties of the group Lasso estimator. Condition (C7)-(c) 
corresponds to condition (C4) and is thought to be a reasonable restriction. Consider, 
for instance, ipi, . . . , ipm are spline functions of degree (z/ + 1) on [0, 1] with equidistant 
knots. By Corollary 6.26 of Schumaker (2007), there exists a g"^ = YIj^t* dT ^ 
such that EjeT* Wdj - QfWl < "^"^'^ EjeT- ^(^j)^- Because E[g*{zij)] = 0, gj" may be 
taken such that E[gj^{zij)] = 0. Therefore, letting for a subset T C {1, . . . , d}, 

/3(T) := inf I /3 > : II ^^,||^ < /3 ^ ||^,.||2, \/g^ g {j G T) I , 

if E,eT* I{9,f ^ and /3(T*) < 1, then \\g* - g"%l < (3{T*) E.^t- h* " ^flli < 
/g(T)m-2^ EjeT* ^(^1)^ ~ s*m-2^. The restriction that 'EjeT* Hdj)^ ^ s* is reason- 
able. Trivial examples in which P{T*) < 1 are the case that s* < 1 or the case that 
Zij^j e T* are independent. Conditions similar to (3{T*) < 1 appear in other papers 
such as Koltchinskii and Yuan (2010). 

We now start to investigate the statistical properties of the group Lasso estimator. 
To this end, we prepare some notation. Define the event 

flo := {||sf ' - ImW < 0.5, 1 < Vj < d}. 

We will later give a sufficient condition under which P(Jlo) — > 0, which means that, 
with probability approaching one, all T,j are "well behaved" in the sense that they are 
not too much deviated from their population values. 
Define the set 

C:={aeM''"*: J] ||aG, |U < 21 ||aG, |U}. 

The set C is a cone, i.e., for any a G C and c > 0, ca e C. It consists of vectors 
ex. e W'''^ such that the coordinates of ex. in the set T* are dominant. Such cones of 
dominant coordinates play an important role in the analysis of penalization methods 
for high dimensional statistical models. Define the C-restricted eigenvalue of S^/^ by 

K := min ||S ' Q:||e. 

Restricted eigenvalues are originally introduced by Bickel et al. (2009) for the Lasso 
formulation. While the minimum eigenvalue of S is always zero when dm > n, k can 
be positive with a high probability as long as the corresponding restricted eigenvalue 
of the population matrix S is bounded away from zero (see Lemma B.4 in Appendix 
B). 
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Put XiGj ■= 'EJ^^'^XiGj, where S""*^^^ is interpreted as the generahzed inverse of 
Ti^^ if it is singular. If = UDU' denotes the spectral decomposition of "Ej 
where ?7 is an m X m orthogonal matrix and iD is an m x m diagonal matrix with 

" —1/2 

diagonal entries di > ■ ■ ■ > di > = d^+i = • • • = dm, then S^- is given by 
SJ"^^^ = U diaig{di^^'^ , .... dj^^"^ , 0, . . . , 0}C/'. Invoke that on the event VLq^ all Yij are 
nonsingular. Define the random variable 



A := max 

l<3<d 



n 



i=l 



This random variable plays a "threshold" value for Ai. 
We state a preliminary bound on g in terms of || • ||2,n- 

Proposition 4.1. On the event {Ai > 2A} r\{k > 0} (1 Qq, we have 

II * -||2 ^ o • £ II * Il2 , r~< ^ TnX^ 

\\9 2,n<2 mi \\g - 5 2,n + -r^-^ , 

whereC2 is a universal constant andQ'^* :={(?: Z —> M : g[z) — YlijeT* Yl^=i (^jkii^ki^j)' 
i^jk) {z = (-21, . . . , Zd)'), I3jk eR{j eT*;l<k< m)}. 

To state the model selection property of the group Lasso estimator, we need another 

concept, namely, group sparse eigenvalues. For any subset T C {1, . . . , d}, let S!^™""^ : = 
{a G M.'^"^ : ckgtc = 0} fl Define the s-th group sparse maximum eigenvalue of 

SV2 by 



0max(s) := max ||I]^/^a||E. 



\T\<s,aeSp'~^ 



The next proposition gives a preliminary bound on s, the number of components 
selected by the group Lasso estimator g: s :— |T°| = |{j e {1, . . . , d} : ||^j||2,n Oil- 



Proposition 4.2. Let C := 'inWg* — g\\2,n/ {s/ s*m\i) and 5 := {s e {1, . . . , c?} : s > 
2C'^0inax(s)^s*}- On the event {Ai > 2A V 0} n VLq, we have 

s < C'^[min0max(s)^]s*. 

s&S 

Propositions 4.1 and 4.2 are deterministic statements, and they do not use any 
stochastic argument. In order to bound stochastic orders of \\g* — g\\2,n and s, we 
have to determine: (i) conditions that ensure P(Jlo) 1; (h) a value of Ai such 
that Ai > 2A with probability approaching one; (iii) conditions that ensure desired 
asymptotic behaviors of k and 0max(s); (iv) an stochastic order of the approximation 
error inf^^gT* \\g* — 5'||2,„- Lemmas B.1-B.5 in Appendix B are concerned with these 
issues. We shall comment that while the proofs of Propositions 4.1 and 4.2 are a direct 
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adaptation of the corresponding proofs in the Lasso case, the proofs of Lemmas B.l- 
B.4 are not the case because the fact that the size (m) of each group goes to infinity 
brings a subtle technical issue. Given Propositions 4.1 and 4.2, and Lemmas B.1-B.5 
in Appendix B, we obtain the following theorem. 

Theorem 4.2. Assume conditions (C1)-(C5) and (CI). Assume further that s*,m,d 
andn obey the growth condition (s*)^mlog(dVn)/n — >■ 0, 0max(s) := maX|y|^^ aes'*"'"^ II^'^'^^ckIIe 
1 for some sequence s — Sn such that s/s* — >■ oo and k :— min„g§dm-inc ||5]^/^q;||e > 1, 
and \\g*\\2 ^ s*. Take Ai > Al^u^/n{l + ^/\ogdJm) with constant Ai^u given in Lemma 
B.2 in Appendix B and m > Then: 

II * -ii2 ^ s*mX\ ^ ^ 

\\9 - 9\\2,n S —Z^^ S<pS . 

I b 

In particular, if m >^ n}/i'^'^+^) and 

A,xmax|vS.y!i^|. (4.2) 
then we have \\g* — <j, s*S^. 

Proof. See Appendix B. □ 

Remark 4.2. When m x n'^/^^i.+i)^ .^^e order of d allowed is log d = o{n'^''/^'^''+^y{s*y}. 
If logd X n" and s* x for some a, 6 > 0, the region that (a, 6) is allowed is 
{(a, 6) : a, 6 > 0, a + 26 < 2v/{2v + 1)}. It is interesting to note that this region is 
large when v is large, i.e., the additive components are more smooth. This indicates 
that the more smooth the additive components are, the larger d and s* can be. 

We consider the magnitude of missed components || 'YlijeT*\f° 9*j\\'2- ^^^^ ^^'^i 
a subset T C {1, . . . , d}, define the T-sparse minimal eigenvalue ^min{T) of S by 

0min(r) := min ||i;^/^a||E. 

We also need a slightly stronger approximation property than condition (C7)-(c). 
(C7) (c)' There exists a, g'^ — ^^^x* dT ^ ^^^h that max^cT* \\^iPTi9j ~ 



< 



gj')\\l<s*m-'r 



CoroIIctry 4.1 (Magnitude of missed components). Assume the same conditions as 
in Theorem 4-^ with condition (C7)-(c) replaced by (C7)-(c)'. Assume further that 
4>min{T* U f) >p 1. Then, we have \\ Ejer*\to 9j\\l s*mXl/n^. 
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This corollary clarifies sufficient conditions under which the magnitude of missed 
components is not larger than the bound on US' — 5'*||2 n- When m x n^/^^'^+^'> and Ai is 
(4.2), then, under the conditions of Corollary 4.1, \ f^\T*\ <p s* and || X]jeT*\fo djWl 
3*6'^. In that case, the second step estimator g with T = and A2,«,i/^ < 
A2 ^ 5 satisfies that \\g — g*\\l + J2jef HOj) This bound holds in general 

cases in which T° may fail to recover T*. If it happens that T° = T* with probabil- 
ity approaching one, the estimator g (with A2 X n-'^/(2i^+i)) enjoys the exact oracle 
rate s*n^'^''^^'^''~^^\ As long as taking A2 x 7^-W(2i^+i)^ the estimator g adapts to both 
situations. 

Sufficient conditions for the perfect model selection are found in, e.g., Theorem 2 of 
Ravikumar et al. (2009). Unfortunately, their condition (39) does not cover our choice 
of the penalty level Ai. Note that the correspondence between their notation (left) 
and our notation (right) is: p — d,dn — m and A^ = y/mXi/n. However, a careful 
inspection of their proof shows that their condition (39) can be replaced by a weaker 
condition that there exists some constant C > such that 

" > C (in their notation), or — -^^ — — > C (in our notation), (4.3) 



dn V logp ' n{m V log d) 

which covers our choice of the penalty level Ai. To see this, observe that their condition 
(39) is used only to ensure (85) in their appendix, which can be replaced by (in their no- 
tation) P(maxjg5c \\gj — f^j\\ > 6/2) 0, or equivalently P(maxjg5c \\Zj\\ > A„(5/2) — )■ 
0. By using ffist the union bound and then Theorem 7.1 of Ledoux (2001) (the Gaus- 
sian concentration inequality) similarly to the proof of our Lemma B.2 in Appendix 
B, it is shown that condition (4.3) is sufficient for that P(maXj£sc \\Zj\\ > A„(5/2) — )■ 0. 

4.4 Comparison with other work 

In this section, we briefly state connections and differences of the proposed method 
from some existing estimation methods for high dimensional additive models. It must 
be said that the literature on high dimensional additive models is now growing; so it 
is beyond the scope of this paper to review all the existing methods in details. 

In Meier et al. (2009), the penalized least squares estimator defined by the solution 
to the following minimization problem is proposed: 

d d 



mm 

ceK,c/je£;,l<j<<i 



i=l i=l j=l ^ J . 



where the term ||5'j||2,n controls sparsity while the term I{gj) controls smoothness of gj. 
Koltchinskii and Yuan (2010) and Raskutti et al. (2010) considered a doubly penalized 
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estimation method similar to Meier et al. (2009) but in a (more general) reproducing 
kernel Hilbert space (RKHS) formulation. Suzuki et al. (2011) further analyzed the 
Meier et al. (2009) method and established a faster convergence rate than Meier et al. 
(2009) did in a more general setting. The method proposed in this paper is thought 
to be a method that splits such a "double penalization" into two steps, and intends to 
remove a shrinkage bias caused by simultaneously penalizing sparsity and smoothness. 

Huang et al. (2010) proposed a two-step estimation method different from ours. 
Their proposal is to construct consistent estimators of the additive components at the 
first step, and then to use these consistent estimators to apply the adaptive group Lasso, 
which is a modification of the adaptive Lasso (Zou, 2006) to the group Lasso case. In 
particular, they proposed to use the group Lasso for the first step estimation. To be 
precise, under the notation of Section 2.2, let denote the solution to the group Lasso 
problem (2.3) with S^- replaced by Irm and use this group Lasso estimator to construct 
the weights: Wj :— 1/||/3g^.||e (we agree that 1/0 = oo). The adaptive group Lasso 
estimator is then defined by g^{z) = E?=i^/(^j)> ^/(^j) = 127=1 ^fki'^kizj) - ij^jk), 
where 

n d ' 

— J2{y^ - + >^Aj2^j\\f3G,\\E . 

The adaptive group Lasso can be seen as a post model selection estimator. In fact, 
since Wj — oo when H^; = 0, the adaptive group Lasso problem reduces to 

i=i j^t j^t 

where f :— {j G {l,...,d} : ||/3^^||£; 7^ 0}. Therefore, their estimation method is 
similar to ours in some respect. Besides the similarity, however, there is a notable 
difference between two methods. Huang et al. (2010) intend to select correctly the 
nonzero additive components with probability approaching one by using the group 
Lasso penalty at both the first and second steps, while our method intends to ensure 
sparsity and smoothness of the estimator. A point to be noticed is that the analysis 
of Huang et al. (2010) substantially depends on the assumption that the non-zero 
additive components are well separated from zero in the L2-sense, which is, as argued 
in Introduction, significantly restrictive from a theoretical point of view, and it is this 
assumption why the adaptive group Lasso estimator can achieve the exact oracle rate 
in their analysis. Therefore, from a strict theoretical sense, their theoretical result is 
not directly comparable to ours (and Meier et al. (2009)). 



$^ :— arg min 



min 
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5 Conclusion 



In this paper we have investigated the two-step estimation of high dimensional additive 
models. Especially, we have derived a generic performance bound on the second step 
estimator, and studied the overall performance when the group Lasso is used as a 
first step variable selection. Diving the overall estimation procedure into two steps 
enables us to help shrinkage bias caused by the double penalization strategy, and we 
believe that the theoretical and numerical properties explored in this paper are useful 
suggestions to practical applications. 



6 Proof of Theorem 4.1 

The proof of Theorem 4.1 uses the next technical lemma. Its proof is based on a use 
of empirical process techniques. Define Hg. := E[gj{zij)]. 

Lemma 6.1. Assume conditions (C1)-(C5). Then, there exist a positive constant Cu,v 
depending only on the distribution of Ui and the smoothness index u, and a positive 
constant Cq^^ depending only on Cq (given in condition (C3)) and v such that the fol- 
lowing holds: for any sequence of nonempty subsets T = Tn C {1, . . . ,d} and for any 

sequence of constants e = e„ — )■ such that t > Cu,^n^'^^^'^'^^^\ we have, with probability 
approaching one: 



in) 



1 " 

ft 

i=l 
1 " 



< max{e, Ci,/log{s V n)/n}J\\gj\\l + e^I{gj)^ ^gj G G, Vj e T; 



i=l 



< max{e, Ci Vlog(s V + e^Iigj^, ^gj e G, Vj e T; 



M llbllL- Ibll2l < ^/(^'^^ max{e, 5} 



.3=1 



' ^^9j: 9j e 

3=1 



where s :— Sn :— \T\ and Ci > is a universal constant. 



Proof of Lemma 6.1. See Section Appendix A. 



□ 



Proof of Theorem 4-1- We first point out that because of the restriction Yll=i 9ji^ij) — 
0, by a standard argument, we may assume that c — c* — E[yi] = for the analysis of 

9- 

Let Cu,u-iCq^u and Ci be the constants given in Lemma 6.1. Take e = e„ = 
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Cu ,.n-'^/(2'^+^) and A2 > \f2e. Define the events 





= event (i) of Lemma 6.1 witli T — 


-'- -1 




— event (i) of Lemma 6.1 with T — 




^3 


— event (ii) of Lemma 6.1 with T - 


--T\ 


^4 


= event (ii) of Lemma 6.1 with T = 


-- {T*y, 


^5 


= event (iii) of Lemma 6.1. 





In what follows, we go through the proof on the events (ll^-^Qk- Note that the prob- 
ability of this event goes to one. Because s* — \T*\ < n, we may assume that 
Ci-\/log(s* y n)/n — Ci-\/log n/n < e in events (i) and (ii) of Lemma 6.1 with T — T*. 
Let g :— Qn ■— max{e, Ci-\/log d/n}. Invoke that ^ x 5. We may assume that ^ < 1. 
For j ^ T, we agree that gj = 0. 
Because of the optimality of g, 



2n 

1=1 j^T 

Then, using the relation 



i=l 



3&T 



{Vi - g{zi)Y = u\ + 2ui{g\zi) - g{zi)) + {g*{zi) - g{zi))\ 



we have (note that g^ = for j ^ T) 

^ll/-^ll2,n + A^E^fe)' 



J6T 



1 " 1 

- E '^iiSjiZij) - 9*j{Zij)} + A2 E ^idjf + IIEjeT*\rfj*ll2,n- 

. i=i J jet 



Using the facts that y/a + b <y/a + Vb,ab< 0.5 (a^ + 6^) and {a + bf < 2{a'^ + b"^), 
"1 " 



E 

jeT*nf 



i=l 



jeT*nf 

<e E \\9j-9j\\2 + e' E ^fe " 



<eJ\T*nf\ J2 \\~9j-9*\\l + 0.5e'\T*nf\+0.5e'J2^(~9j-9;f 

jeT*nf 



T*nT 



<eJ\T*nf\ J2 ||^,-5;||i + o.56^|T*nf| + 6^ Yl iihf + ^' E ^^9]f. 

jeT*nf jeT*nf jeT*nf 
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For any fixed 6 > 0, 



e l\T*nf\ J2 
y jeT*nf 



*l|2 



9j\\2 



2be^\T*nf\x- J2 



jeT*nT 



9j\\2 



Similarly, we have 

E 



1 " 

i=l 

1 " . 



= E 



iet\T* 



jeT\T* 



jeT\T* 



Thus, we have 



jet jet 

+(A^+62) E %;)'+^iiE,.Tnf^;ii2,n- (6.1) 



Recall the definition of a{T). Invoke now that 



Eii^^-^*ii2< 



jet 



'j\\2^ E 
jeT*uf 



9j\\2 



<2 E llfe-/^*)-^lll2 + 2E'"l K:=%iK)]) 
jeT*uf j&f 

<2a(T-uf)||e-^s)-9lli + 2E''i 



4 



<2a(T*uf)||^-^*||^ + 2E4- 
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Because of the restriction Yl^=i oi^ij) = 0) f^gj — ^ J2^i=i{9j{^ij) ~ f^gj)^ so that, 
because of the event fl^, for all j eT* nf, 



<'2eT9j-9*\\l + 2e'\\g*\\l + e'l{~gjf, 



while because of the event Q4, for all j e f\T*, /jij. < + g^e^I{gjY. Thus, 

noting that g > e, 

(l-max{46^2^2})5^||^-^;||^<2a(^*U^)||^-^*||^+462 J2 ||^;||^+2^V ^ 
jet jeT*nf jet 

so that for n large enough (such that max{4e^, 2g^} < 0.5), 

Y.\\~9j- 9*111 <MT*uf)rg-g*\\l + Se' II^HI^ + JfeO^. (6.2) 

jet jeT*nf jef 

Substituting (6.2) into (6.1), we obtain 

l\\9*-m,n+U-e'-'^)j:'i9^)' 



< 11^ - ^11^ + {b + 0.5){e'\T* n T| + g^|f \r*|) + ^ Yl W 



jeT*nT 



2 

2,n- 



(6.3) 



j€T*r\T 



We next consider a lower bound on ||gf* — ^||2„ Observe that 





Win 


>ll/ 


-~9\\l 


>ll/ 


-m 


>ll/ 


-~9\\l 



-1 2 



E \/\\9;-~9M + ^'H9*-9j)' 



-1/(21.) 



jeT*UT 

max{e,5}|r*ur|{a(r*uf')||/-^||^ + e^ J2 1(9* - ~9jf 

jeT*uf 



>(l-ci)|b 



* ~9\\l 



2c,e^Y.^i9*r-^c,e^Y.^{~g,r, 



where Ci := c^^^^e-^/^^^) max{e,d}\T*UT\a{T*UT) and C2 := c^^^e-^/^^:.) max{e, (5}|T* U 



1/(2:.) 
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T\. Substituting this inequality to (6.3), we have 

1 ci a(T*uf) 



2 2 



2e 



<{b + 0.5){e'\T*nf\ + g'\f\T*\) + ^ J2 \\9*\\l + i>^l + + c,e') 



jGT*nT 



jeT*nT 



j€T*\T 



We wish to bound the term || ^j^x^xr 9j\\2,n- Observe that 
\\'^jeT*\f9j II2 ,n 

< \\Y.j^T*\f9j\\l + c,,.e-'/^''^^ max{e, 5} 



jeT*\f 

<iiE,eTnf^?;ii2+c.,.e-v(2'^vax{6,5}|rvi E mwi+e'm"} 

jeT*\f 

< IIE,eT*\f^;il2 + c,,.6-V(2'^) max{6, 5}|T*\r| | a(r*\r)||E,eT*\f^;il^ + 6^ E ^(^D' 

< (1 + C3)||E,eT*\f ^;il2 + C4e^ E ^(9*)^ 

where £3 := Cg^^e-'^/^'^''^ max{e, 5}|r*\r|Q;(r*\f') and £4 := c^.^e'V^^'^) max{e, 5}|r*\f'|. 
Therefore, we have 



2 2 



1 ci Q;(r* U T) ^ ~||2 , / x2 2 re 2 \ r/- ^2 

II5 -^ll2+(^2-e --^--C2e )2^/(5j) 



2e' 



<(b + 0.5)(e'\T*nf\ + g'\f\T*\) + ^ E Iblll' + (A^ + e' + £26^) E ^(^D' 



jeT*nT 



jeT*\T 



Taking b = 4a (T* U T) > 4 and noting that ^) < 1, we have 



i-|)lls--slll + ]A^-(l + <iO''[E^(*) 



< {4a(T*uf) + 0.5}(e2|T*nf| + ^)2|f\T*|) + ^||/||2 + (A2 + e2 + C2e2) ^ %*) 



+ 



(1 + 23) 



iiE,eT.\f^;ii^+(c2 + ^)6^ E ^(0, 

jGT*\f 
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where we have used the inequahty 

jeT*r\f jeT* 

Because Ci = Op(l),C2 = Op(l),C3 = Op(l) and C4 = Op(l) by condition (C6), wc have 
Ci < 1/4:, C2 < 1/2) C3 < 1 and 64 < 1 with probabihty approaching one. Define the 
event 

^^6 := {Cl < 1/4, C2 < 1/2, C3 < 1, C4 < 1}. 

Recall that > 2e^. Therefore, on the event n^^^^fife, we have 

^11/ - 9\\l + X E ^(9jf < {MT* U T) + 0.5}{e^\T* nf\ + g''\f\T*\) 
jet 

+ ^ibiii + (A^+i-5e') E %;)' + iiE,.T.\f^;ii^+^' E ^(^i)'- 

jgT*nf jeT*\f 
The desired conclusion follows from the fact that P(n^^]^f2fc) ^1. □ 
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A Proof of Lemma 6.1 

In the proofs below, we agree that C denotes a universal constant, and its value may 
change from line to line. The same rule applies to Appendix B. 
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A.l Preliminary lemmas 



In this section, we collect some preliminary results used in the proof of Lemma 6.1. 
We begin with introducing an interpolation inequality by Gabushin (1967). 

Lemma A.l (Gabushin (1967)). For any f G 1^2^([0, 1]) with positive integer v, 




{2y-\)l{2v)u f(y)y\l{2v) 
L2(\) \\J IIL2(A) ' 



L2(A), 

where K is a constant independent of f, and 
to the Lebesgue measure A on [0, 1] . 



^/ll/^'^^lk(A)=0, 



|l2(a) denotes the L2-norm with respect 



We also use the next lemma. For any probability measure Q on [0, 1], define the 
u X v matrix 



Sq w := E 



Z~Q 



(1,Z,...,Z''-') 



( 1 \ 



Lemma A. 2. Let Q he any probability measure on [0, 1] such that the matrix ^q,v is 
non-sigular. For any f G W^g^'lfO) 1]) ^ positive integer), there exist functions /I^' 

and such that (i) f = /W + /l^l; (iz) ||/W||oo < const. x/(/W) (the constant is 
independent of f); (Hi) /'^^ is a polynomial function on [0, 1] of degree v — 1; and (iv) 

Jfmfi'UQ = o. 

Proof (sketch). Lemma A. 2 is used in Meier et al. (2009) but without proof. For the 
sake of completeness, we provide a sketch of the proof. Take any / G 1^2"([0)1])- 
It is standard to see that there exist functions f^^^ and /'^l such that / = /'^l + 
/'^'; ll/'^^lloo < -^(/'^^) and /PI is a polynomial function on [0, 1] of degree u — 1 (use 
Taylor's theorem). Let /^^l denote the orthogonal projection (in L2{Q)) of /^^l onto 
the space of all polynomial functions of degree f — 1. Then, using the fact that Sq^,^ 
is non-singular, by a simple algebra, each coefficient of (/c = 0, . . . , — 1) in /'^l is 
bounded by i^||/W||oo < i^/(/™), so that ||/W||oo < i^'/(/™) {K and K' arc constants 
independent of /). Replacing /[^l by /^^l — /^^^ and /t^^ by /'^^ + f^^\ we obtain the 
desired conclusion. □ 

It is standard to see that is non-singular if the density of Q is bounded away 
from zero on [0,1]. The next lemma is due to Corollary 5 of Meier et al. (2009), 
which is basically deduced from an entropy integral argument and a peeling argument 
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(such techniques are described in Chapter 8 of van de Geer (2000)). Recall that 



HI) 



2 ._ 



Lemma A. 3 (Essentially Meier et al. (2009), Corollary 5). Let he i.i.d. 

from a distribution Q on [0, 1] such that Sq,i/ is non-singular, and let ai, . . . ,an be 
independent Rademacher random variables independent o/^i, . . . ,^n- Let denote 
the conditional expectation with respect to ai, . . . ,an given ^i, . . . ,^n- Then, there exists 
a positive constant C^, depending only on u such that for all e > C^n~'^^^'^'^~^^\ 



E 



sup 



< Ci,e, Eg. 



sup 

/(/)<l,ll/l|oo<l V "■' "2'" ^ 



where ||/||2„ := n EiLi /(^«)^ ll/lli '■— ■E[/(Ci)^]- The conclusion is true when 
o"! , . . . , cr„ are independent standard normal. 

Proof. The first inequality is Corollary 5 of Meier et al. (2009). Note that their s, a 
and 7 correspond to s — u,a — 1 — l/(2i/) and 7 = 2/ {2u + 1) in our notation. The 
second inequality can be shown in a similar way. □ 

Addendum A.l. It is clear that, under the same conditions of Lemma A. 2, for any 
constant K > 0, 



E„ 



sup 

few^ {[0,1]) 

U(/)<l,ll/l|co<X 



In + e'lif? 



<ae. 



The next two lemmas compare the empirical and population L2-norms on the class 
Q uniformly over the distributions of Zij {j e T) . 

Lemma A.4. Assume conditions (CI), (C3) and (C4). Let T be any subset of 
{!,..., d} and s := \T\. Let be the constant given in Lemma A.l. Then, there 
exists a positive constant Cq_y depending only on Cg (which is given in condition (C3)) 
and V such that, as long as 

<e< \ and C^.^e^^/^^^) max{e, Vlog(sVn)/n} < 0.5, 

with probability at least 1 — (s V n)"^, 

II^.IIL < l-5||5.||2 + 0.562/(5,•)^ V^,- e Q, Vj e T. 
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Proof. In this proof, Cq^i, denotes some positive constant depending only on Cq and 
v. Its value may change from line to line. Pick any j G T. For a constant e e (0, 1] 
specified later, define 

By Lemma A.l, invoke that when I{gj) 7^ 0, 
and 

Even if I{gj) — 0, these inequalities hold (with a suitable change to the constant 
Cq^i, if necessary) since e < 1. Thus, by Massart's (2000) form of Talagrand's (1996) 
inequahty, for alH > 0, with probabihty at least 1 — e~*, 

Zj < 2E[Zj] + Cq,,^e-^/n/n + Cq^.e-^'H/n. 

Applying Lemma A. 3 with the symmetrization inequality (van der Vaart and Wellner, 
1996, Lemma 2.3.1) and the contraction principle (Ledoux and Talagrand, 1991, The- 
orem 4.12), for all e > C^n''''^'^''+^\ we have E[Z^] < Cg^^e^^^-^)/^^'^). Therefore, taking 
t = 21og(s V n), we have, with probability at least 1 — (s V n)^^, 

Zj < Cq,, max{e(''^-^)/(''^) , e'^/^"'^) Vlog(sVn)/n, e"^/'^ log(s V n) /n}. 

By the union bound, the above inequality simultaneously holds for all j & T with 
probability at least 1 — (sVn)~^. The desired conclusion now follows from the additional 
restriction that Cq,i,e~^/(^''^ max{e, •\/log(s V n)/n} < 0.5. □ 

Lemma A. 5. Let Vp^i denote the set of all polynomial functions on [0, 1] of degree 
u — 1. Assume conditions (CI), (C3) and (C5). Then, with probability approaching 
one, \\hj\\l j^ < 1.5||/ij||| for all hj e V^-i and 1 < j < d. 

Proof. As in the previous proof, Cg^i, denotes some positive constant depending only on 
Cq and v. Its value may change from line to line. By normalization, it suffices to show 
that maxi<j<(i sup/j^.g^^. |||^j||2,n ~ 1| 0) where T-Lj := {hj : hj G V^-i, \\hj\\2 = 1}. 
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Pick any 1 < j < d. By condition (C3) (ii) and Lemma A.l, ||/ij||oo < Cg,!/||^j||2 = C*g,i/ 
for all hj e Hj, and E[/ij(^ij)'^] < ||/ij||Lll^jll2 — Cq,^- Therefore, by Massart's form of 
Talagrand's inequality, for all t > 0, with probability at least 1 — e~*. 



where Zj :— sup^^g^^. | — 1|. We wish to evaluate E[Zj]. Let (7i,...,(7„ be 
independent Rademacher random variables independent of Zi, . . . , z„. By the sym- 
metrization inequality and the contraction principle. 



E[Zj] < C,,.E 



sup 



1 " 
^ i=i 



Arguing as in Meier et al. (2009, p. 3813) (or using a standard entropy integral argu- 
ment), it is shown that 



E 



sup 



1 " 

-^aihj{zij) 



< 



Taking t — 21ogd, we have, with probability at least 1 — d ^, 



\ogd 



n 



(Recall that log d/n 0.) 

By the union bound, the above inequality simultaneously holds for all 1 < j < d 
with probability at least 1 — d~^. RecaUing that d — > oo, we obtain the desired 
conclusion. □ 



A. 2 Proof of Lemma 6.1 

Parts (i) and (ii): We first point out that (ii) follows from (i). Suppose that (i) is 
true for the case that Ui are independent Rademacher random variables independent of 
^i, . . . , z„. Let (Ti, . . . , CTj denote independent Rademacher random variables indepen- 
dent of Zi, . . . , By the symmetrization inequality for probabilities (van der Vaart 
and Wellner, 1996, Lemma 2.13), for all t > y/S/n, 




Thus, by (i), the right side goes to zero with t — 4max{e, Ci-\/log(s V n)/n}. 
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In what follows, we wish to show (i). Take Cy as in Lemma A. 3 and let e = e„ — )■ 
be any sequence such that e > Ci,n~^/^'^^^^\ Define the events 

^^7 {hjWln < l-5|b,||2 + O.Be'ligjf, V^,- e Q, Vj e T}, 
:= {\\hj\\ln < ^Mh,\g yh, G V.-i, Vj e T}, 

where V^-i denotes the set of all polynomial functions on [0, 1] of degree u — 1. 

We first consider case (a) in condition (C2). By normalization, it suffices to consider 
the case that |iti| < 1 almost surely. Pick any j e T. Consider the function 

\n'^J27=iUigj{zij)\ 



Fj{u, z.j) sup 



9j^S \/\\gj\\l + e^I{gj 



sup 

9j 



€6 V\\9j\\l + ^^H9j 



u 



(ui, ■ ■ ■ , tin) ^-j — {^Iji ■ ■ ■ 1 ^nj) -I 



where the second inequality is due to the fact that Q is symmetric, i.e., if gj G Q 
then —gj G Q. Given 2i, . . . , z„, the map u ^ Fj{u, z.j) is Lipschitz continuous with 
Lipschitz constant bounded by 



sup 



-l''2ll I 
" ' \\(Jj\\2,n 



which is, on the event l^y, bounded by Cn~^/^. Therefore, by Corollary 4.8 of Ledoux 
(2001), on the event O7, 

V^{Fj{u,z.j) > E^[Fj{u,z.j)] + Ctn-^/^} < de"^^*', 

where (E„) denotes the conditional probability (expectation, respectively) with 
respect to Ui, ... ,Un given Zi, . . . , z„, and Ci > and C2 > are universal constants. 



By Lemma A. 2, invoke that gj G Q can be written as gj — gf^ + g^P such that (i) 



,[2] 



5'f'lloo < const. y.I{(jP) (the constant is independent of g'j); (ii) g^ G Vv-\i and (ni) 



[2] 



E[5'?'(-Zij)S'i'^^(-Zij)] = 0. Observe that on the event fl Jig, 



J2] 



+ 



II. [2] I 



[2], 



■"1-3) 



< c 




1 V^" 

Ei=i Uig) 



9f\\ln + eH{gfy 



+ 



I -1 v^n [2]/ 



J2]| 



2,n 



> . 



By Lemma A. 3 (see also Addendum A.l) and the fact that it^'s can be replaced by 
independent Rademacher variables by the contraction principle, the conditional ex- 
pectation (given Zi,...,Zn) of the first term inside the brace is bounded by C^t. On 
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the other hand, by Meier et al. (2009, p. 3813), the conditional expectation of the sec- 
ond term inside the brace is bounded by a constant times n~^/^ where the constant 
depends only on v. So there exists a constant C'^ depending only on v such that 
E^j[Fj(u, z.j)] < C^e on the event fiy fl fig. Therefore, on the event n fig, 

Pu{Fj{u, z.j) > Cle + CVlog(s V n)/n} < Ci(s V r^)-^ 



where we have taken t — yj^c^ ^ log(s V n). 
We now move j. By the union bound. 



P{maxFj(u, z.^) > C'^e + C ^\og{s y n) / n] 



< P 



{maxFj(w, z.j) > e} n fiy Pi f^g 



+ p(n^7) + v{Qi) 



< ci(s V n)-^ + P(Q^) + P(Q^). 



By Lemmas A. 3 and A. 4, P(i^7) + P(f^8) ~^ 0- Therefore, we obtain (i) for the case 
that I Ml I < 1 almost surely. 

We next consider case (b) in condition (C2). Recall that Ui\zi ~ A^(0, 
and (Tu{zi) < (Tu almost surely. Put Ui := Ui/au{zi). Then, ui,...,Un are indepen- 
dent standard normal random variables independent of Zi, . . . , z„. Consider now the 
function 

\n''^Jl=i^iCru{Zi)gj{zij) 



Fj{u, Zi) := sup 



9j 



^0 VhM + ^'Hdj)' 



-, W = . . . , Un)', Zi = {^1, . . . , Zn}. 



Given zi, . . . , Zn, the map u ^ Fj{u, z^) is Lipschitz continuous with Lipschitz con- 
stant bounded by 



cr„ sup 



^9,eg ^/\\gj\\l + e'^Iigj)^' 
which is, on the event Q^, bounded by Caun~^^^. Therefore, by Theorem 7.1 of Ledoux 
(2001), on the event f)?, 

Pu{F,{u, Z-) > E^F^iu, z^)] + Cajn-'/^} < e-'"l\ 

By the contraction principle for Gaussian processes (Ledoux and Talagrand, 1991, 
GoroUary 3.17), 



o Li U . 

9i^Q Vllfjlli + e^^(fj)^ 



The rest of the procedure is the same as the previous one. 
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Part (iii): The result basically follows from the same argument as the proof of 
Meier et al. (2009, Theorem 6). For the sake of completeness, we provide an outline 
of the proof. In what follows, Cg^^ denotes some constant depending only on Cq and u. 
Its value may change from line to line. 

Let (Ji , . . . , (Jn denote independent Rademacher random variables independent of 
Zi, . . . , Zn- Define 

^ I llfflli.n ~ llgllil 

Z sup \^-' U^i-^9i^^)\ _ 

By Lemma A.l, for = X)?=i Qj^ 9j ^S, 

d d 

and 



.j=i 



where we have use the inequality that ||5f||2 < Ylfj=i Ikilh < X]j=i Vllfi'illi + ^"^^{djY- 
Thus, by Massart's (2000) form of Talagrand's (1996) inequality, for all t > 0, with 
probability at least 1 — e~*. 



Z < 2E[Z] + Cq^.^t-^n/n + Cg,,t~^/H/n. 

We wish to evaluate E[Z]. Using the symmetrization inequality and the contraction 
principle, we have 

E[Z] < C,,,e-^/(2'')E[Z]. 

By a standard calculation, 

E[Z] < E 



max sup 



n J2'^=icrigj{z, 



13) 



Recall that 8 = max{n-^/(2v+i)^ y'logd/n}. By lemma 13 of Meier et al. (2009) and 
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Lemma A. 3, we have 



E 



1^ ^T.7=i'^i9ji 



max sup , 



< 4 max E 

i<j<d 



sup 



+ 



2C,,,e-V(2-)(i + logd) /4(l + logd) 



3n 



+ 



< Cq^y max < e, 



n \ n 



< Cq^i, max{e, 5}. 

The last inequahty is because 
e-V(2'^)logci ^ 

Thus, with probabihty at least 1 — e""*, 



logd 



X 



n 



Z < C,,,max{e(2-i)/(2-),e-V(2v)5^ ^J^Vn/^.e-^'^t/n). 

Letting i — >■ oo sufficiently slowly (such that ^/t/n < e), we have, with probability 
approaching one, Z < Cq^^e'^^'^'^"^ max{e, 5}, which imphes the desired conclusion. 

□ 



B Proofs for Section 4 

B.l Proofs of Propositions 4.1 and 4.2 

We first point out that since Yl^=i = 0, by a standard argument, we may assume 
that c* — E[yi] = for the analysis of g. 

Proof of Proposition 4-1- The proof is a direct adaptation of that of Bickel et al. (2009, 
Theorem 6.1), so we omit the detail here. 

□ 

For a subset T C {1, . . . , d}, let XiCj, denote the m\T\xl vector stacked by XiCpj G 

T. 

Proof of Proposition 4.2. Recall that := {j e {1, . . . , d} : ||^j||2,n > 0}. Note that 
on the event Qq, = {j e {1, . . . , d} : ||/3gJU 0}- By the Karush-Kuhn-Tucker 
condition, on the event 
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which imphes that on the event {Ai > 2A} fl Qq, 



mAi 



Vs < 



n 



< 



< 



< 



1 " 

- J]*iG^o(2/i-*i/3) 

i=l 
1 " 

•i=l 

^ n 1 

1.5 



(ri := g*{zi) - g{zi)) 



i=l 



1=1 



2n 



1 

n 

1=1 



Applying the Cauchy-Schwarz inequahty to the last line, we obtain 



mXi 



VI <3 



n 



1 

n ^ 

«=i 



< 3||g'* - ^||2,n0max(s). 



Thus, on the event {Ai > 2A V 0} fl Qq, 

We wish to show that 0max(s)^ < mingg^ 0max(s)^- Because the map s i— )> 0max(s)^ is 
non-decreasing, it suffices to show that s < s for any s & S. Pick any s e 5. Suppose 
on the contrary that s > s. Then, 

S < C^^^sfs* = C^^^is ■ is/s)ys* < C^\s/s']4>ra..isfs* < 2C\s / s) ^^.^isf s* < s, 

a contradiction (we have used the property 0max(^s)^ < \l'\(f>max{sy for I > 0, which can 
be shown is a similar way as the proof of Belloni and Chernozhukov (2011b, Lemma 
8)). Therefore, we obtain the desired conclusion. □ 

B.2 Proof of Theorem 4.2 

Notation: In condition (C5), let K denote a fixed constant such that 

sup \\{,p,{z),...,iPm{z)y\\E<Km'/\ 
ze[o,i] 

Theorem 4.2 follows from Propositions 4.1 and 4.2 together with Lemmas B.1-B.5 
below. In what follows, we always assume conditions (C1)-(C5) and (C7). 



Lemma B.l. Assume that mlogd/n — >■ 0. Then, P(ilo) ^ 1- 
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Proof of Lemma B.l. Without loss of generality, we may assume that E[a;i] = 0. It 
Eces 1 

SiLi ^iGj^'io)-! it suffices to show that 



suffices to show that maxi<j<rf — A 0. Because = Sqj — Xq^Xq. (Soj : 



max a;G- Lb — >■ 0, max Sq,- — ImW 0. 

l<j<d ^ l<j<d 



The second assertion follows from Lemma 3.2 of Kato (2011). We wish to show the 
first assertion. Pick any j e {l,...,d}. By Corollary 4.5 of Ledoux (2001), for all 
t > 0, with probabihty at least 1 — e~*^/^, we have 



\xgA\e < {l + tK)y/m/n, 



where we have used the fact that EdlaJc^lU] < y I^"=i E[||a;iGj|||;] = \/m/n and 
lo^'a^iGjI ^ K^/m (since we are considering the one-sided deviation inequality, 2 in 
front of the exponential term in Corollary 4.5 of Ledoux (2001) can be replaced by 1). 
By the union bound, the above inequality simultaneously holds for all 1 < j < d with 
probability at least 1 — (ie~*^/^. Taking t = 2\/log d, we obtain the desired result. □ 

Lemma B.2. There exists a positive constant Ai^u depending only on the distribution 
of ui such that for any Ai satisfying 



m 



we have P{Ai > 2A} 1. 

Proof of Lemma B.2. This follows from deviation inequalities in product spaces. We 
ffist consider case (a) in condition (C2). By normalization, it suffices to consider the 
case that \ui\ < 1 almost surely. Pick any 1 < j < d. Consider the function 



Fj{u,z.j) :-- 



^UiXiGj\/m 

•4 = 1 



(Recall that x^g^ is generated by Zij.) Given Zi,...,Zn, the map u i— )■ Fj{u,z.j) 
is Lipschitz continuous with Lipschitz constant bounded by ^Jnjm (invoke that the 
maximum eigenvalue of n~'^'YTi=\^iGj^\Gj is 1). Thus, by Corollary 4.8 of Ledoux 
(2001), for alH > 0, 



P„{F,(u,z.,) > E„[F,(u,z.,)] + Ct^^} < , 

where P„ (E„) denotes the conditional probability (expectation, respectively) ofui, . . . ,Ur, 
given Zi, . . . ,Zn, and ci > and C2 > are universal constants. A direct calculation 
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shows that E„[Fj(M, z.j)] < i/n (invoke that the trace of the matrix Z^"=i ^iCjX^Q. 
is bounded by m). Therefore, taking t = \/2c^^ log d, we have, with probabihty at 
least 1 — Ci(i~^, 



Fj(u,z.j) < \fn ^ C nXo^djm. 

By the union bound, the above inequality simultaneously holds for all 1 < j < with 
probability at least 1 — Cio?"^. Recalling that d — > oo, we obtain the desired conclusion 
for the case that \u\\ < 1 almost surely. 

In case (b) in Condition (C2), we use Theorem 7.1 of Ledoux (2001) instead of its 
Corollary 4.8. Recall that Ui\zi ~ A'"(0, (7u(zi)^) and (7„(zi) < ay, almost surely. Put 
Ui :— Ui/au{zi). Then, ui,. . . ,Un are independent standard normal random variables 
independent of Zi, . . . , z„. Define 



^Ui(Tu{Zi)XiGj 



i=l 



U 



Given Zi. . . . , z„, the map u i— )■ Fj{u, z^) is Lipschitz continuous with Lipschitz con- 
stant bounded by Gu^fnjm. Thus, by Theorem 7.1 of Ledoux (2001), for all i > 0, 

V^{F,{u, zl) > E4F,{u, z^)] + aj^/^} < e-'"/\ 

where, by a direct calculation, Eu[Fj{u, Zi)] < Ou\/n. The rest of the procedure is 
exactly the same as the previous one. □ 

Lemma B.3. Let 0max(s) denote the s-th group sparse maximum eigenvalue o/E^/^; 
0max('S) := supi^K^ ^pgdm-i ||S^/^a||£;. Assume that s,m,d and n obey the growth 

s'^mlog{d V n) 



condition 



0. 



n 



(B.l) 



Then, 0max(s) <p 1 provided that (f) 

Recall that for a subset T C {1, . . . , d}, XiCj, denotes the mlTj x 1 vector stacked by 
XiGjJ e T. We use the notation: St := n~'^ Y.i=i ^■i^Gr^'iGT^ "-"^ Yh=i ^iGr^'iGT 

and St := E[xigt.x[(.^]. 

Proof of Lemma B.3. Without loss of generality, we may assume that E[a3i] = 0. Pick 
any subset T of {1, . . . , o?} such that |r| = s. Wc wish to evaluate a tail probability of 
||St — St||. Since ||St — St|| < ||Sot — St|| + H^GtIII) we separately evaluate the 
right two terms. 

We first evaluate the term ||Sot — '^t\\- Invoke the expression 



]or — St|| = sup 



1 

- Y^{o!xiG^f - E[{0LXiGj 

^ i=l 
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Applying Massart's (2000) form of Talagrand's (1996) inequality to the right sum, for 
alH > 0, with probability at least 1 — e~*, we have 



||SoT-Sr|| < 2E[\\^:oT-'ST\\] + CK(^>^^{s)^/t^^ + CKHsm/n, 

where we have use the fact that \cx'xigt\ ^ K^/sm and E[{cx'xiGj,y] < K'^smE[{cy.'xiGj,y 
ir^sm0max('S)^. We now bound the expectation E[||I]or — StU] by Rudelson's (1999) 
inequality: 



^rii^ „ ,,, 1^^,, / X /smlog(sm) ^9 ^,,smlog(sm) 



n n 
Thus, for all t > 0, with probability at least 1 — e~*, we have 



^ ,, 1 , / N sm(t y \og(sm)) Ksm(t V \og(sm)) 

Because the number of all subsets T of {1, . . . ,d} such that |T| = s is (j) < {ed/sY, 
the above inequality simultaneously holds for all such T with probability at least 
1 — exp{s log(eci/s) — t}. Taking t — 2s\og{ed/ s) and recalling condition (B.l), we 
have 

max IISoT - StII = max ||Sot - St|| <p o(l)(0max(s) V 1) = o(l), (B.2) 

provided that 0max('S) < 1. 

It remains to bound ||*GtIIe- By Corollary 4.5 of Ledoux (2001), for all i > 0, with 
probability at least 1 — e~*^/^, we have 



\xgAe < il + CKt)^^' 



where we have used the fact that E[||a;G'y < \/n~'^ SiLi ^[ll-^iGTlll;] — \fsmjn and 
\oi!xxgt\ ^ K^/sm. Taking t — s log(eo?/s), we have 

max II^GtIIe — max llaJG-rllE = oJl). (B.3) 

|T|<s \T\=s II 7-11 i^v / 

Combining (B.2) and (B.3), we have 

0max(s)^ < 0max(s)^ + maX ||St - St|| = 0max(s)^ + Op(l) <p 1. 

□ 

Lemma B.4. Letn denote the C-restricted eigenvalue ojY}!'^: k :— min„g§dm-ipic ||S^/^a 
Assume that s*,m,d and n obey the growth condition 

{s*fm\og{dyn) ^ ^ 
n 

Then, we have k>pl provided that 0max(2s*) < 1 and k > 1. 
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Proof of Lemma B.4. We partly use an idea of Zhou (2009), but the overall proof is 
quite different. Put cq = 21. For any a e C, we decompose a into a set of subvectors 
fy-Gr^, , ctGr^ ) • • • ) ^Gt^ such a way that To corresponds to the s* largest groups in o: 

in the Euclidean norm, Ti corresponds to the s* largest groups in ckg-tc and so on. 

-'0 

Then, we have Tq = U^^T; where |T;| = s* for 1 < / < L — 1 and \Tl\ < s*. Because 
for each / > 1, 



\ocgt,\\e < Vs*^Six\\aGA\E < 



we have 



•c^ / * ^ ,I«G,I|E, 



L-1 



1 :r 



jGT* 

< (1 + Co)||aGT*||E 

< (1 + Co)||q:gtoIU> 

where we have use the fact that X^jeT* ll'^cJU ^ ^^g^j^^c ||ckGj. and ||aGT*llE — 
IIckGtoIIe by construction. Therefore, we have 

L 

J^||c»:gtJU< (2 + Co)||q:gj,J|e. (B.5) 
In what follows, we identify ocqt^ ^ the dm x 1 vector a. such that acr, = ckGt; 



and all the other elements of a. are zero. Under this identification, a can be written 
as q: = Si^o ^Gt ■ Invoke now that 



L L 



|a'(S - S)a| < J]^; |a'G.,(S - ^)ccGr, I 

i=0 l'=0 

L L 

= X] \\f^GT^\\E^\\OLGT^^\E " |^K^ - {hi := ttG^^ ||q:GtJ|e) 

Z=0 i'=0 

< (2 + co)laG,J||^maxJ/iK^-5])/iH, (B.6) 
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where we have used (B.5). Take and fix any such that < 1,1' < L, and let 
T' = T/ U Ti>. It is not hard to see that 



Since < 2s*, we see that 



max |/i;(S - S)/ii/| < max ||St-St||. (B.7) 

0<1,1'<l' \T\<2s* 

Combining (B.6) and (B.7), we have 

Ik^ -k^\< max |a'(S - S)a| < (2 + cq? max IISt - Srll- 

aG§<i™-inC |T|<2s* 

By the proof of Lemma B.2, if (B.4) and 0max(2s*) < 1, we have max|r|<2s* W^t — 
Sr|| = Op{l). Therefore, we have k"^ > k,^ — Op{l) >p 1 provided that k > 1. 

□ 

Lemma B.5. // \\g*\\l < s*, then inf^^gT* ||^* - g\\l^ <p s* max{m-^'',n-^}. 

Proof of Lemma B.5. Let g"^ denote an element of such that US'*— 5'"^||2 = ^^^geg^* Wq* 
g\\2. g"" can be written as ^'"(z) = T^j^t* dTi^j) = (^^i, • • • , ^^d)') and gj'i-) = 
c* + YJk=i(^'jk^k{-) for some c* e (3*^ e ^ (I < k < m) for each j e T* . Be- 
cause E[5(*(2i)] = 0, we have E[5(™(zi)] = 0, so that each c* may be taken such that 

gn-) = EZimM-)-nM^i,)])- Take ^(z) := E,eT*Ek=i^UMzj)-^jk). 
Invoke now that 

lb* - rwin < ng* - gniln + ng"^ - rwin 

-ng* - gniln + Hn-'Elig'^'i^i)?- 

By condition (C7)-(c) and Markov's inequality, we have 

\\g*-g"'\\ln<ps*m-'-', 

while by the fact that E[5('"(zi)] = 0, we have 

{n-'T:=,g"'izi)r <v n-'\\g-\\l < n-\\\g*\\l + U - ^11^) < s*n-\ 
Therefore, we have 

inf , 11/ - g\\ln < \\g* - g^lln <p s* max{m-^^ n-'}. 
seer 

□ 
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B.3 Proof of Corollary 4.1 

Take g"^ — "^j^rp* gj" as in condition (C7)-(c)'. Without loss of generality, we may 
assume that ElgY'izij)] — for all j e T*, so that each ^rj* is written as = 
Ek=il3%iM-) - ^[M^ij)]) for some G R (1 < A; < m). Define (3* e R''^ by 
p*^ = for j e T* and 1 < A; < m, and /3^. = for j e (r*)^ Let g"^ := 

Eti^f' ^r(-) T^Zil^UM-) - ^jk)- By the proof of Lemma B.5, we have 
_ ^"*|||^ <p 8*171'^". Invoke that 



m||2 < s*rnXi 

2," ~P ^2 ' 



11/3 - /3*||| < 0„.in(T* U f)-'\\g - rWln <P \\9 - 9%n + \\9* - ~9 
so that 

\\'^jeT*\fo9j^\\2 — ^inax{s ) 1 1 /^G^ . ^^^o 1 1 £ ~P ' 

Therefore, we have 

22 2 5*7?7'A^ 

IEj6T*\f5'j II2 ^ 2||5^^gy*,^j,o5'J^||2 + 2|Ejgy*\f.o(5'j — 5'7*)||2 ~^ ■ 



□ 



C On condition (C2) 

Suppose that Zi, . . . ,Zn are given and fixed. As argued in Section 2.1, the key prop- 
erty of the error distribution to our rate analysis of Theorems 4.1 and 4.2 is the 
normal concentration property (around its mean) of a random variable of the form 
s^PteT X]r=i '^i^i where T is a bounded and countable subset of R", which means that, 
letting Z := sup^g^-XlILi '^i^i, 

P{Z > E[Z] + ar) < Cexp{-cr^), Vr > 0, (C.l) 

where := sup^^^ ^"^^^ t?, and c > and C > are fixed constants (see the proofs 
of Lemmas 6.1 and B.2). Condition (C2) gives a primitive sufficient condition for this 
normal concentration property. Sec Ledoux (2001) for an excellent exposition of the 
concentration of measure phenomenon. On the other hand, Meier et al. (2009) assumed 
a uniform subgaussian condition 

E[exp{uy L)\zi] < M, a.s., (C.2) 

for some fixed constants L > and M > 0, which is weaker than our condition (C2), 
but their established rate s*{\ogd/n)^''/^^'"^^^ is suboptimal. Suzuki et al. (2011) later 
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showed that the Meier et al. (2009) estimator achieves the minimax rate 3*5"^ but 
assumed that the error term is uniformly bounded.^ It is thus of some interest how 
our rate analysis changes if our condition (C2) is replaced by weaker (C.2). 

To the best of the author's knowledge, it is not known whether the uniform sub- 
gaussian condition alone ensures the normal concentration property (C.l). However, 
by Theorem C.l ahead, under the uniform subgaussian condition, a slightly weaker 
inequality 

P(Z > E[Z] + ar) < C exp(-crV logn), Vr > 4, (C.3) 

holds. A careful inspection of the proofs leads to that if condition (C2) is replaced 
by (C.2), under some modifications to the conditions, the rate of convergence of the 
second step estimator will be 

max [s*n~''^/^''^^'\ \f\T*\~d^ WEjeT^xfgnl] ' 
in the canonical case, where 

^:=max|n-/(^-+^),y^ ^^"g"^j^"g^^ 

and rate of convergence of the group Lasso estimator will be So the second step 

estimator at least achieves the rate s*6'^. The only difference is the appearance of the 
additional logn term, and as long as \ogd/ (nlogn) — > 0, s*5^ is faster than the rate 
s*{logd/n)'^'^^^'^'^~^^\ It is also expected that, under the uniform subgaussian condition 
(C.2), the Meier et al. (2009) estimator has the same rate of convergence as s*P. 

C.l Proof of (C.3) 

Recall the ■0Q,-norm: 

= inf{s > : E[exp(X"/s")] < 2}, a> 0. 

Let T be a bounded and countable subset of M". 

Theorem C.l. Leiei,...,e„ be independent random variables such that ma:Xi<^i<n\\^i\\xp2 ^ 
for some constant C^. Put Z := sup^^^^^^^^ ejij. Then, for all r > A, we have 

P{Z > E[Z]+raC^} < Cexp(-cr7 logn), 

where a := sup^g^- X]r=i ^f' ^'^^ c > and C > are universal constants. 

^Koltchinskii and Yuan (2010) and Raskutti ct al. (2010) dealt with a different estimator and 
established the rate s*6'^ under different settings. Koltchinskii and Yuan (2010) assumed that the 
error term is uniformly bounded, and Raskutti et al. (2010) assumed that the error term is normal 
independent of explanatory variables. 
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Theorem C.l is essentially proved in Mendelson and Tomczak-Jaegermann (2008) 
(but the exponential term is slightly worse in Mendelson and Tomczak-Jaegermann 
(2008): exp(— cr^/ log^ n)). For the sake of completeness, we provide a proof of the 
theorem. The proof of this theorem uses some properties of the ■j/^i-norm. 

Lemma C.l. (Ledoux and Talagrand, 1991, Theorem 6.21) Let Ci; ■ ■ ■ )Cn be indepen- 
dent centered random variables. Then, 



II J^CilUi < C E[| J^Cil] + II ^f^J\^i\U^ 

where C is a universal constant. 

For the evaluation of the term || maxi<i<„ iCillUi) we use the next lemma. 

Lemma C.2. (van der Vaart and Wellner, 1996, Lemma 2.2.2) Let Ci; ■ ■ ■ )Cn be any 
random variables. Then, 

II I^illl^^ < C(logn) max ||Ci||vi, 

l<i<n l<i<n 

where C is a universal constant. 

Proof of Theorem C.l. In this proof, c and C denote some universal constants. Their 
values may change from line to line. Let e~ := ei/(|ei| < L) and el :— eil{\ei\ > L). 
The constant L > is defined later. Define T :— T U {—t : t e T}. Let Z~ :— 
sup^g^^"^^ e^ti and Z+ := sup^^^^"^^ e^ti. Clearly, Z < Z~-\-Z'^, and by using that 
Er=i ^TU = J:Uie,-e+)t, = Er=i ^^t^+T:=l 4{-U). we have E[Z-\ < E[Z]+E[Z+], 
so that E[Z] > E[Z-] - E[Z+]. Observe that 

P{Z > E[Z] + raC^} < F{Z- + Z+ > E[Z-] - E[Z+] + raC^,} 

< P{Z- > E[Z-] + raC^/2} + P{Z+ + E[Z+] > raC^/2}, 

Because je^^j < 2L, by Corollary 4.8 of Ledoux (2001), we have 

P{Z- > E[Z-]+raC^/2} < C exp(-cr^Cl/L^), Vr > 0. 

On the other hand, it is standard to see that ||(e^)^||vi — W^W^p^ — and 

n 

E[J2{etr]<nm^E[ep{\e,\>L)]] 

< n max E[et]^/^F{\ei\ > Lf" 

l<i<n 

<nCCleM-cLyCl). 
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Thus, by Lemmas C.l and C.2, 

n n n 

II E(^.^)'ii^. < II Y^ii^rr - mmu. + nY^^^f] 

1=1 i=l 1=1 

< CCj{nexp(-cLVCj) + logn}. 

Take L = CC^V^ such that EE;Li(e^+)'] < and || Y:7=ii4Thi < CCllogn. 
Because Z+ < (7{J2tii4fV^^^ ^e have 



ll^+IU^ < CaC^y/h^, E[Z+] < aC^. 

Therefore, for all r > 4, we have 

P{Z+ + E[Z+] > raC^/2} < P{Z+ > raC^/A} 

< Cexp(— cr^/ log n). 

This completes the proof. □ 
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